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REJOINDER TO DISCUSSIONS OF “FREQUENTIST COVERAGE 
OF ADAPTIVE NONPARAMETRIC BAYESIAN CREDIBLE SETS” 

By Botond Szabo, A. W. van der Vaart and J. H. van Zanten 
TU Eindhoven, Leiden University and University of Amsterdam 

We thank the discussants for their supportive comments and interesting 
observations. Many questions are still open and not all methodological or 
philosophical questions may have an answer. Our reply addresses only a 
subset of questions and is organized by topic. A final section reviews recent 
work. 

1. Hierarchical Bayes credible sets. Our paper considers empirical Bayes 
tuning of the posterior distribution, whereas many Bayesians might prefer 
to use a hierarchical Bayes approach. Ghosal and Rousseau ask whether, or 
conjecture that, the hierarchical Bayes procedure behaves similarly as the 
likelihood based empirical Bayes procedure. Indeed, we can show exactly 
the same coverage of hierarchical Bayes credible sets for polished tail truths. 
A counterexample showing that hierarchical Bayes credible sets also do not 
cover without some restriction was already given in [14], while the size of 
such sets follows from [7]. Thus, within the context of our paper there is no 
difference between the two schemes. 

In the hierarchical Bayes approach we endow the regularity hyperparam¬ 
eter a with a hyperprior distribution A, and then apply an ordinary Bayes 
method with the overall prior, for some upper bound A (possibly dependent 
on n), 

n(-)= [ Ua{-)X{a)da. 

Jo 

For n(-|A("')) the posterior distribution relative to this prior, a hierarchical 
Bayes credible ball centered at the posterior mean On is defined by its radius 

^n,7* 

( 1 . 1 ) Ii{e:\\e-kh<fnn\x^"^) = ^-i- 
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We blow this up a bit, and for L > 0 consider 
( 1 . 2 ) Cn{L) = {eef: \\e - Onh < irn,bl¬ 

under a mild regularity condition on A, similar to that in [7], these sets 
cover polish tail truths. 


Theorem 1 . 1 . Suppose that there exist ci,C2 > 0 , C3 and 04,05 > 0 , with 
C3 > 1 if 02 = 0 , such that cjexp(—C2Q:) < A(a;) < 040“'^® exp(—C2Q;), 
for all a > Cl and X{a) > 05 for all 0 < a < ci. Then for any positive 
A,Lq,Nq there exists a eonstant L sueh that 


(1.3) 


inf 

Oo&Opt{Lo) 


^ Cn{L)) 1- 


Furthermore, for A = An < \/logn/ (d-^/log p V e) this is true with a slowly 
varying sequence [L : = Ln^ (3p3(i+2p)^A„ y^orks]. 


Proof. The probability of interest Pegdl^n — ^olb < is bounded 

below by 

P^odl^O ^ 9 Q^n,a^ II2 T ll^n,a^ II2 T ||^n II2 ^ Lfn^-f)- 

Therefore, the theorem follows from Theorem 5.1 of [16], if 


ll^o -E0(,4,a„||2 < n 


1 “"n/(l+2“n+2p) 




(1.4) Peo{\\0n-0n,aj2 


< ^ 1 , 


(1.5) Pe,{rn,^ > C 3 n-“"/(i+ 2 a„+ 2 p)) ^ 


The Hrst two assertions follow immediately from (5.8) and (5.9) of [16]. 
For the proof of (1.5), we first note that, for any given C 3 > 0, 

n(0:110 - 0 n ||2 < (73n-“"/(^+2“"+2p)|A:(”)) 


fOLn 

/ n„(0:110 

J Q;„ 


+ OP(,g(l). 


0n||2 <C3n““"/(^+2""+2p)|x(’"))A(alw("))da 


The right side becomes bigger if we replace 9n by 6 n,a, as the latter is the 
center of the Gaussian distribution n„(.jw(’^)), and again bigger if we replace 
oin ill the rate inside the probability by a. From the proof of (5.7) of [16], it 
follows that there exists a constant C 3 , such that for every a, 

n„(0: 110 - 0„,„||2 < C'3n-“/(^+2a+2p)|^(n)^ <1-27. 
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Then the integral in the preceding display is asymptotically smaller than 
1 — 27 , whence (1.5) follows by the definition of rn,'y 

To prove (1-4), we proceed similarly to the proof of (4.4) of [14]. By 
Jensen’s inequality, 

\\0n-en,ajl< j \\9n,c. - doi 

(1-6) < SUp_ \\0n,a^-0n,a\\l 

a&\a„,an\ 

+ sup_ \\9n,a^-9n,a\\l ( _ A(a|x(’^)) da. 

We separately bound the two terms on the right side. First, as a„ € [ani^n]) 
by several applications of the triangle inequality, 

sup 

ae[a„,On] 

^ 2 sup jjEgpdjj Q, do ||2 T 2 sup jjdjj Q, o,|| 2 . 

ag[a„,an] «e[a„,an] 

As a consequence of (5.8) and (5.9) of [16], this is bounded above by a 
multiple of n“—with -probability tending to one. Eor the 
second term, we first note that similar to the preceding display, with 
probability tending to one, 

sup j j dn,a II2 — ^ SUp j jEgp On,a ^01| 2 T 2 SUp j j Egj,0jj q, ]] 2. 

a a a 

As a consequence of (5.10) and (5.11) of [16], this is uniformly bounded 
by a constant times Udolll + 1 ~ with P^^-probability tending to one. 
Furthermore, in view of Section 7 of [7], 



<2e-(C'4nl/(l+2-n+2p))/(i+2«„-r2p)^^Qg^^C5gC'6ev’^/3 

^ 2e-(C'4nl/il+^“"+2Pb/(2(l-H2a„-h2p)) 

< ^-(2o„)/(l-|-2o„-|-2p) 

Therefore, by Markov’s inequality, the second term on the right-hand side of 
(1.6) is bounded above by a multiple of j 2 -( 2 «n)/(i+ 2 a„-i- 2 p)^ which is smaller 
than the same rate at a„, with Pg^,-probability tending to one. □ 

For the adaptive size we note that similar to the proof of assertion (4.5) 
of [14], it can be shown that there exists a positive constant Cy such that 

Peo{rn,-y < ^ 1 . 
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Then following [16], we get the rate adaptive size for Sobolev balls, hyper¬ 
rectangles, analytic balls,... etc. 

2. Shape of the credible sets: Bands versus balls. All discussants pointed 
out that L 2 -confidence sets are harder to visualize than confidence bands, 
that is. Loo-balls. We fully agree. See our remarks on plotting below. 

We chose to consider L 2 -balls because they fit naturally in our inverse 
problem setup and can be studied theoretically with reasonable ease. At the 
same time, we believe that they provide an accurate (or at least not mis¬ 
leading) rendering of the general phenomena surrounding adaptive credible 
sets. We fully agree that it is of interest to work out similar results for other 
norms and other situations. 

One of the theoretical difficulties to handle credible bands is to describe 
the Loo-norm in terms of quantities that are controllable under the prior 
and posterior. Several authors (starting with [4]) have recently obtained 
contraction rates in this norm, and their work may well be extendible to 
adaptive credible sets. 

Ghosal proves a rate of contraction for the uniform norm, for parame¬ 
ters such that < oo and a prior that depends on a. He next argues 

heuristically that the resulting credible sets with an adaptive choice of a will 
cover relative to the uniform norm. This is possible, but the particular em¬ 
pirical Bayes a from our paper may for many true parameters not estimate 
Ghosal’s a, but a different value. 

We have encountered similar phenomena when deriving contraction rates 
and credible intervals for (not necessarily continuous) linear functionals of 
the parameter; see [13]. Since point evaluations are linear functionals, such 
credible intervals can be glued together into Loo-credible bands, where due 
to the Gaussianity one would expect at most a logarithmic factor to be nec¬ 
essary to pass from pointwise to simultaneous intervals. A difficulty is that 
Sobolev regularity is not the most useful concept when estimating a function 
at a point; one would like to employ a Holder norm. As a worst case one 
loses a 1/2 when passing from Sobolev to Holder, and this loss was seen 
to be real for the minimax contraction rate in [7, 8]. The likelihood-based 
empirical Bayes method seems to “estimate” the Sobolev regularity of the 
truth. In [13] we have shown that coverage can be retained by subtracting 
1/2 from the estimate, thus under-smoothing the empirical Bayes posterior 
distribution. In forthcoming work with Sniekers, we note that the ordinary 
empirical Bayes procedure may still give good coverage for many true pa¬ 
rameters, the loss of 1/2 being really a worst case comparison of the two 
norms and coverage being connected to more subtle properties of the true 
parameter. 
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3. Simulation and plotting. We included some pictures in the paper, and 
we feel that they nicely illustrate the limitations and strengths of adaptive 
Bayesian credible sets. The pictures consist of individual plots of all curves 
in the 95% out of 2000 curves simulated from the posterior distribution 
that are closest to the posterior mean. Within the resolution of the pictures 
these curves form a ragged grey band and it is tempting to view this as a 
confidence band. 

We may have mislead the reader to think that the pictures show the L 2 - 
credible set that we study theoretically in the paper. However, as already 
noted, L 2 -balls are difficult to plot. To relate our plots to these balls, it 
seems one would have to “visually compute” the L 2 -distance of the plotted 
curves to the center of the band (the posterior mean), take the maximum 
distance, and compare this to the L 2 -distance of a tentative function to this 
center, in order to see whether this function is in the ball. This is hard to 
do. The pictures are not formal credible bands either. Still, they manage to 
give an impression of where the posterior distribution puts its mass. 

Low and Ma describe this difficulty very accurately. In particular, our 
choice of making exactly 2000 draws was rather arbitrary and, indeed, at 
other places we have also produced pictures showing just 20 draws (without 
rejecting any). All these pictures seem to illustrate the effect of the bias- 
variance trade-off, and its possible failure, on credible sets reasonably well. 

Low and Ma also suggest a method for constructing Loo-confidence bands 
from the L 2 -credible balls and apply it to the adaptive posterior distribu¬ 
tion. Bayesians will be delighted to see that the empirical Bayes method 
performs satisfactorily in their simulation study. The new concept of cover¬ 
age introduced by these authors, together with Cai, is interesting. 

Castillo also addresses the discrepancy between our analytic definition of 
a credible ball and our small simulation study. He points out that the radius 
can be simulated more precisely. He also suggests that simulating curves 
from distributions that are rougher than the posterior might be useful to 
fill out the gap between the support of the posterior and the ball. This is 
an interesting suggestion, but we would be reluctant to simulate from other 
distributions than the posterior distribution. We imagine that this could be 
queried in many ways, for example, to produce bands, intervals for specific 
functionals or perhaps even of qualitative aspects of parameters, but we 
would support the Bayesian view that the posterior distribution gives a full 
report of the analysis. 

Nickl and Castillo [3] have introduced an approach toward credible sets 
based on a nonparametric Bernstein-von Mises theorem. Nickl writes to be 
“unsure to which extent ^ 2 -credible balls are applied in current practice as 
claimed in the introduction of (our paper),” and next suggests that “Prac¬ 
titioners may prefer (...) to compute credible balls in H-spaces.” Castillo 
wonders about our opinion that “no method that avoids dealing with the 
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bias-variance trade-off will properly quantify the uncertainty.” We do not 
believe we have claimed that i 2 -balls are routine in practice; if we did, then 
we retract that claim here. We do claim that posterior distributions are rou¬ 
tinely used for uncertainty quantification, often by simulating from it. Then 
a main finding of 50 years of nonparametric statistics, theory and practice, 
is that the bias-variance trade-off drives everything, setting it apart from 
classical, parametric statistics, which deals mostly with variance, as bias is 
negligible, particularly in the large-sample limit. The ^ 2 -setting of our paper 
incorporates the bias-variance trade-off, and hence we believe that our the¬ 
oretical results are relevant. It appears that Nickl and Castillo’s “Bernstein- 
von Mises theorem in H-space” removes bias, essentially by parameterizing 
the function as a collection of smooth functionals that can be estimated as 
the parameters in classical parametric models, with neglect of bias. Their 
work is very intriguing and pretty. However, as it explains away bias, we 
found it difficult to believe that it solves the nonparametric problem. It is 
still more intriguing that pictures by Ray in [9], which are based on the EI- 
spaces of [3], look similar to ours. Possibly that is because these pictures do 
not show their suggested set, just as our pictures are deficient in this sense. 
This deserves further investigation. 


4. Other priors. The discussants pose the question whether our results 
extend to other priors than the A^(0, i“^“^“)-priors in our paper. We believe 
the answer is affirmative: it appears that the polished tail condition is not 
linked to the form of the priors. 

One reason to believe this are preliminary results, of ourselves and in a 
forthcoming thesis of Sniekers at Leiden University, about priors of the form 

OO 

n, = []iV(0,rVi-2“), 

i=l 

where a is fixed, but r is adapted to the data, by either an empirical or 
hierarchical Bayes method. For empirical Bayes we plug the marginal max¬ 
imum likelihood estimator fn of r into the posterior distributions for fixed 
r, and construct adaptive credible sets of the form 

(4.1) CniL) = { 9 £f: \\e - 6n{Tn)\\2 < Lrn,^{fn)}, 

where Onij) is the posterior mean and satisfies 

(4.2) U{e : \\e - en{T)\\, < r„,^(r)|xW) = 1 - 7 . 


Theorem 4.1. For any A,Lq,Nq there exists a constant L such that 


inf 

do&Bpt(Lo) 


£ Cn{L)) —)• 1 . 
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Hierarchical Bayes credible sets will similarly cover. However, these sets 
have the disadvantage that they may be unnecessarily big. In our paper [15] 
we proved that the corresponding posterior distributions contract at the 
minimax rate over Sobolev balls of regularity /3 < a + 1/2, but only at the 
suboptimal rate 77,-(i+2«)/(4+4o) j£ ^ > q, -|- 1 / 2 . The latter suboptimal rate 
is partially due to the variance of the posterior distribution, and hence, in 
the case that /I > a + 1/2, the radius of the credible balls will be suboptimal 
as well. 

5. Choice of basis. Rousseau and Castillo point out that the polished 
tail condition is dependent on the chosen basis, whereas one might hope or 
expect the set of “good behaving” true parameters not to depend on the 
basis. 

In inverse problems the eigenbasis of the operator K*K plays, implicitly 
or explicitly, an important role to describe the problem [1, 6, 8, 10] and, 
hence, it is natural to assume the polished tail condition with respect to this 
basis. Other bases were explored in recent work [5], but a good link between 
the operator and the prior seems always needed. 

In “direct problems” one can consider any basis. This then determines 
both the prior and the polished tail condition. The prior, or rather collection 
of priors, will be chosen to model a scale of models that is thought to capture 
the true parameter. In our situation these were Sobolev spaces, which are 
naturally described in a basis. That the polished tail condition will adopt 
the same basis seems not unnatural. After all, “good-behaving” is not an 
absolute property of a parameter, but is relative to a method, which is the 
one induced by the prior in this case. 

There is a good scope for extensions to other models and priors. In our case 
the coefficients could be modeled differently than independent and Gaussian, 
although both seem natural. We imagine that similar results as in our pa¬ 
per can easily be written down for double-indexed bases, as wavelets, thus 
moving closer to the earliest works on self-similarity. More challenging will 
be priors such as Dirichlet mixtures, which are known to adapt to the band¬ 
width in the (normal) kernel. What can be said about their coverage? 

6. Further references. The paper [9] derives an adaptive and nonpara- 
metric version of the Bernstein-von Mises theorem, using techniques de¬ 
veloped in [3] and [7], under a self-similarity restriction, and next applies 
this result to construct adaptive credible sets. The same paper also con¬ 
siders spike and slab priors and Loo-credible bands. The author of [2] in¬ 
vestigates credible sets from an oracle perspective. He considers truncated 
(finite dimensional) Ganssian priors and shows that the empirical Bayes 
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approach chooses the optimal truncation level under a (slightly) extended 
version of the polished tail condition. This family of priors is relatively wide 
and contains a member that attains the minimax posterior contraction rate 
for every regularity class . The authors of [12] have followed up their 
work with investigating adaptive pointwise credible sets using rescaled (in¬ 
tegrated) Brownian motion as a prior in the nonparametric regression model. 
Random smoothing spline priors with Gaussian weights on the spline coef¬ 
ficients are shown in [11] to give honest credible sets in the nonparametric 
regression problem under the self-similarity condition. 
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